Semantic search and response

ABSTRACT

An approach to information retrieval is contemplated for facilitating semantic search and response over a large domain of technical documents is disclosed. First, the grammar and morphology of the statements and instructions expressed in the technical documents is used to filter training data to extract the text that is most information-rich, that is the text that contains domain-specific jargon, in context. This training data is then vectorized and fed as input to an SBERT neural network model that learns an embedding of related words and terms in the text, i.e. the relationship between a given set of words contained in a user&#39;s query and the instructions from the technical documentation text most likely to assist in the user&#39;s operations. There are two parsing tasks. The first is to select a minimal sample of sentences from the document corpus that capture the domain-specific terminology (jargon). The result is set of sentences used to train BERT and SBERT. The second parsing task to create a set of action-trigger phrases from the document corpus. The trigger potentially matches a user query and the action is the related task.

FIELD OF THE DISCLOSURE

The present application relates generally to information retrieval, andmore particularly to facilitating semantic search and response.

BACKGROUND

Commercial information retrieval systems have evolved significantly inthe last several decades from search engines that introduced graph-basedalgorithms for sifting through millions of webpages to return relevantresponses, to NoSQL databases that integrate document search andretrieval as first-class features, to cognitive systems that aim totransform documents into interactive question and answer applications.The primary challenge of building language data products withoff-the-shelf tools is that language data is incredibly complex,producing high-dimensional, sparse vectors that present significantencoding challenges. In addition, as opposed to formal language, naturallanguage encodes meaning not in individual tokens but contextually. Forexample, the term “ship” can function as both a verb and a noun, anddepending on context it could be a synonym for the term “transport” oran acronym for a longer noun phrase (e.g. “software hint implementationproposal”). In the construction of context and domain specific languageproducts, these contextual semantics must be preserved to successfullyautomate technical question and answer systems. Therefore, improvementsin semantic searching and response are desired.

SUMMARY

In a first aspect of the present invention, a method of processing adocument corpus includes receiving a corpus of documents; selecting aset of sentences from the corpus and creating a corpus-sentence set;processing the corpus-sentence set using BERT to create word-levelassociation; processing the corpus-sentence set using SBERT to createsentence-level association; and creating a neural network for the corpususing the set of sentences, word-level associations and sentence-levelassociations.

There are two parsing tasks. The first is to select a minimal sample ofsentences from the document corpus that capture the domain-specificterminology (jargon). The result is a set of sentences used to trainBERT and SBERT. The second parsing task to create a set ofaction-trigger phrases from the document corpus. The trigger potentiallymatches a user query and the action is the related task.

The foregoing has outlined rather broadly the features and technicaladvantages of the present invention in order that the detaileddescription of the invention that follows may be better understood.Additional features and advantages of the invention will be describedhereinafter that form the subject of the claims of the invention. Itshould be appreciated by those skilled in the art that the conceptionand specific embodiment disclosed may be readily utilized as a basis formodifying or designing other structures for carrying out the samepurposes of the present invention. It should also be realized by thoseskilled in the art that such equivalent constructions do not depart fromthe spirit and scope of the invention as set forth in the appendedclaims. The novel features that are believed to be characteristic of theinvention, both as to its organization and method of operation, togetherwith further objects and advantages will be better understood from thefollowing description when considered in connection with theaccompanying figures. It is to be expressly understood, however, thateach of the figures is provided for the purpose of illustration anddescription only and is not intended as a definition of the limits ofthe present invention.

BRIEF DESCRIPTION OF THE FIGURES

For a more complete understanding of the disclosed system and methods,reference is now made to the following descriptions taken in conjunctionwith the accompanying drawings.

FIG. 1 is a block diagram illustrating a semantic search and responsesystem, according to one embodiment of the present invention;

FIG. 2 is a screen shot illustrating the results of a search, accordingto one embodiment of the present invention;

FIG. 3 is a screen shot of details of a part of the search of FIG. 2,according to one embodiment of the present invention;

FIG. 4 is a flow diagram of the overall processing of the text selectorof FIG. 1, according to one example embodiment of the present invention;

FIG. 5 is a more detailed flow diagram of the creation of the sentenceset from FIG. 4, according to one example embodiment of the presentinvention;

FIG. 6 is a more detailed flow diagram of the processing step of aportion of FIG. 5, according to one example embodiment of the presentinvention;

FIG. 7 is a more detailed flow diagram for processing the glossary,according to one example embodiment of the present invention;

FIG. 8 is a more detailed flow diagram for processing the table ofcontents, according to one example embodiment of the present invention;

FIG. 9 is a more detailed flow diagram for processing indexes, accordingto one example embodiment of the present invention;

FIG. 10 is a more detailed flow diagram for processing tables, accordingto one example embodiment of the present invention;

FIG. 11 is a flow diagram, according to one example embodiment of thepresent invention;

FIG. 12 is a flow diagram, according to one example embodiment of thepresent invention;

FIG. 13 is a flow diagram, according to one example embodiment of thepresent invention;

FIG. 14 is a flow diagram, according to one example embodiment of thepresent invention;

FIG. 15 is a flow diagram, according to one example embodiment of thepresent invention;

FIG. 16 is a flow diagram, according to one example embodiment of thepresent invention;

FIG. 17 is a flow diagram, according to one example embodiment of thepresent invention;

FIG. 18 is a flow diagram, according to one example embodiment of thepresent invention;

FIG. 19 is a flow diagram, according to one example embodiment of thepresent invention;

FIG. 20 is a flow diagram, according to one example embodiment of thepresent invention;

FIG. 21 is a flow diagram, according to one example embodiment of thepresent invention;

FIG. 22 is a flow diagram, according to one example embodiment of thepresent invention;

FIG. 23 is a flow diagram, according to one example embodiment of thepresent invention;

FIG. 24 is a flow diagram, according to one example embodiment of thepresent invention;

FIG. 25 is a flow diagram, according to one example embodiment of thepresent invention;

FIG. 26 is a flow diagram, according to one example embodiment of thepresent invention;

FIG. 27 is a flow diagram, according to one example embodiment of thepresent invention;

FIG. 28 is a block diagram illustrating a computer network, according toone example embodiment of the present invention;

FIG. 29 is a block diagram illustrating a computer system, according toone example embodiment of the present invention; and

FIG. 30 is an illustration of an RDMS database design, according to oneexample embodiment of the present invention.

DETAILED DESCRIPTION

In one embodiment, a novel approach to information retrieval iscontemplated for facilitating semantic search and response over a largedomain of technical documents is disclosed. First, the grammar andmorphology of the statements and instructions expressed in the technicaldocuments is used to filter training data to extract the text that ismost information-rich—that is the text that contains domain-specificjargon, in context. This training data is then vectorized and fed asinput to an SBERT neural network model that learns an embedding ofrelated words and terms in the text, i.e. the relationship between agiven set of words contained in a user's query and the instructions fromthe technical documentation text most likely to assist in the user'soperations.

The query processing system leverages BERT (Bidirectional EncoderRepresentations from Transformers). BERT includes various techniques fortraining general purpose language representation models using enormouspiles of unannotated text on the web (“pre-training”). These generalpurpose pre-trained models can then be fine tuned on smallertask-specific datasets, e.g. when working problems like questionanswering and sentiment analysis. BERT generates a representation ofeach word that is based on the other words in the sentence. BERTconsiders both the words that follow and the words that precede.

The query processing system also leverages SBERT (Sentence BERT). SBERTuses sentence pair regression tasks like semantic textual similarity(STS). SBERT uses Siamese and triplet network structures. Multiple datasources are passed simultaneously in the same trainable transformerstructure. SBERT derives semantically meaningful sentence embeddingsthat can be compared with cosine-similarity.

There are two parsing tasks. The first is to select a minimal sample ofsentences from the document corpus that capture the domain-specificterminology (jargon). The result is a set of sentences used to trainBERT and SBERT. The primary source for domain-specific terms is theglossary, the table of contents, indexes and explicit tables. The secondparsing task is to create a set of action-trigger phrases from thedocument corpus. The trigger potentially matches a user query and theaction is the related task. The primary sources include the glossary,table of contents, index, cross references, highlights, explicit tablesand implicit tables. As an example, in a glossary, the glossary termwould be the action and the chosen sentence would be the trigger. In atable of contents, the action can be the heading text and the triggercan be the next sentence. In an index, the index term can be the actionwhile the sentence containing the index term is the trigger. For crossreferences, the action can be the cross reference and the trigger can bethe cross reference source sentence. SYNSET can be used to find synonymsand cosine similarity technology can be used to find the related text onthe target section or page.

Generated triggers are encoded using the pretrained SBERT model. Theresulting embeddings are stored on disk for repeated use. When the usermakes a query to the application, the query is encoded using thepre-trained SBERT model. A cosine similarity score is computed betweenthe embedded query and each embedded trigger. The cosine scores aresorted to yield a ranked list of action-trigger phrase sets.

Model metrics can be gathered for each user-entered query. A cosinescore of top results is used and the max, min and mean scores aregathered. The user's behavior can also be collected to identify linksthat were clicked and track related searches in the user session. Metricresults can be evaluated to identify low performing queries. The queryresults can be evaluated offline periodically by a domain expert.Scoring can be used as feedback to retrain the model and the triggerphrases.

Referring to FIG. 1, an overall architecture for the query processingsystem 100 is shown. The processing falls into three large groupingsincluding training the neural network model 102 on the domain specificterminology, or j argon; creating the set of action-trigger phrases 104for query processing; and processing user queries 106. A text selector108 selects a minimal sample of sentences that captures thedomain-specific terminology from a document corpus 110. The documents inthe document corpus 110 might exist in any format such as .pdf, .doc,.xml or other formats. The present disclosure leverages the formalizedwriting style of the documentation and is agnostic to the containermechanism. The result is a set of sentences 112. The sentences 112 arefed into BERT 114 and SBERT 116 to produce a neural network 118 tailoredto the terminology specific to the document corpus 110. The combinationof the neural network model 102 and the action-trigger phrases 104 toproduce a trained database 126, not only improves the accuracy of thesearch, but increases the overall speed and efficiency of the search andreduces the computational overhead of the overall query processingsystem 100.

An action trigger producer 120 takes documents from the document corpus110 and, from each document, it selects action-trigger phrases that arepotential matches for user queries, resulting in an action-triggerphrase set 122. This resulting set 122 is applied to the neural network118 at 124 to create a trained database 126. The trained database 126 istrained on the domain-specific terminology and action triggers extractedfrom the document corpus 110. The query processing tool 128 takes a userquery 130, vectorizes the query 130 using the neural network 118 anduses a cosine similarity approach to match against action-triggers inthe trained database 126, producing a result report 132.

As an example, a user types a query for a topic, such as “set TPMkeyword” at 128. The query processing tool 128 uses a set ofaction-trigger phrases that supports the neural network's queryprocessing. The query processing tool 128 produces a result report 132.FIG. 2 is an example showing the top 10 results 200 in descending orderof the cosine Confidence Score. By clicking on the “Details” button 202,the tool shows the sentence and paragraph context of an action-triggerphrase. If the user selects button 202, the tool shows the top 5document contexts 300 for the action-trigger phrase as shown in FIG. 3.

Referring back to the text selector 108 of FIG. 1, creating a minimalset of sentences that encompass all of the document corpus uniqueterminology is important to decrease computational time, because BERTrequires n*(n−1)/2 computations (where n is the number of sentences fromthe document corpus 110). The processing creates as output a collectionof sentences to pass to BERT processing 114 called the CORPUS-SENTENCEset 112. The goal for populating the CORPUS-SENTENCE set 112 is tocapture corpus-specific words and phrases (jargon) in context in asentence. The processing creates intermediate structures including aCORPUS-TERM set containing the terms discovered in a glossary; aTOC-ENTRY set containing the table of contents entries, each classifiedas a major or minor heading; and an INDEX-ENTRY set containing theentries in the index, each classified as primary, secondary or tertiary,and the page or section reference for the index entry. These structuraland content elements of a document provide the richest set of sentencescontaining domain-specific terms for training the neural network model102. These include glossary entries and sentences from the document,index entries and their target sentences, formal tables with theirdescription sentences and document text sentences immediately followingmajor section headings.

FIG. 4 is a flow diagram of the overall processing 400 of the textselector 108 by the processor. The process begins at start 401. At 402,a corpus of documents is received to be processed. In one test case, thedocument corpus contained 65,000 pages for formal end-user documentationplus 150,000 pages of technical documentation (architecture, design andtraining materials). Human intervention into the processing wasimpractical given the volume of information to be parsed. At 404, theprocessor determines if the document corpus contains a glossarydocument. If “YES” then at 406 a CORPUS-TERM set is created by theprocessor from the glossary. If “NO” then at 408 an empty CORPUS-TERMset is created by the processor. At 410, sentences are selected from thecorpus. At 412, BERT is used by the processor to create word-levelassociation. At 414, SBERT is used by the processor to create sentencelevel association. At 416, a neural network tailed for this documentcorpus is created and flow ends at 418.

FIG. 5 is a more detailed flow diagram of the creation 500 of thesentence set by the processor at 410 from FIG. 4. Flow starts at 501. At502, a document is selected from the corpus. At 504, sentences areselected from the document and placed into the CORPUS-SENTENCE set bythe processor. At 506, it is determined if there are more documents inthe corpus. If “YES” the process repeats at 502. If “NO” then theprocess ends at 508.

FIG. 6 is a more detailed flow diagram of the processing step of 504 ofFIG. 5 by the processor. Processing begins at 601. At 602, it isdetermined by the processor if the document contains a table ofcontents. If “YES”, then the table of contents is processed at 604. If“NO”, it is determined by the processor if the document contains aglossary at 606. If “YES”, then the glossary is processed at 608. If“NO”, then it is determined by the processor if the document contains anindex at 610. If “YES”, the index is processed at 612. If “NO”, it isdetermined by the processor if the document contains explicit tables at614. If “YES”, the explicit tables are processed at 616. If “NO”, thedocument text is processed at 618. Because individual sentences may beselected multiple times using the table of contents, glossary and indexstructures, at 620, duplicates are eliminated by the processor. At 622,the CORPUS-SENTENCE set is written for input to the BERT and SBERTprocessing. Flow ends at 624.

FIG. 7 is a more detailed flow diagram for processing the glossary 700by the processor. The flow begins at 701. At 702, the next term in theglossary is captured and added to the CORPUS-TERM set. At 704, thedefinition corresponding to the term is captured, concatenated with thedefinition into a single sentence, and the result added to theCORPUS-SENTENCE set by the processor. At 706, it is determined by theprocessor if more terms exist in the glossary. If “YES” then the processrepeats at 702. If “NO” the process ends at 707.

FIG. 8 is a more detailed flow diagram for processing the table ofcontents 800 by the processor. Flow begins at 801. At 802, the nextentry in the table of contents is captured. At 804, the table ofcontents is classified at a major or minor heading. At 806, the table ofcontents entry is captured into the TOC-ENTRY set. At 808, it isdetermined by the processor if there are more entries in the table ofcontents. If “YES”, then the process repeats at 802. If “NO” the processends at 809.

FIG. 9 is a more detailed flow for processing the index entries 900 bythe processor. Flow begins at 901. At 902, the next entry in the indexis captured. At 904, the entry is classified as primary, secondary, etc.At 906, the index entry and associated page and/or section reference iscaptured into INDEX-ENTRY set. At 908, it is determined if there aremore entries in the index. If “YES”, the process repeats itself at 902.If “NO”, the flow ends at 909.

FIG. 10 is a more detailed flow diagram for processing the formal tables1000, or explicit tables, by the processor. Flow begins at 1001. At1002, the explicit table is discovered. At 1004, sentences from the“Description” column of the table are captured. An NLP technique is usedto determine which column contains description information. To determinewhich column is the description column, the processor applies patternmatch and Natural Language Processing techniques. First, it looks forthe keyword “Description” as a header label. Second, the SYNSETprocessing technique can be used to identify table header labels thatbelong to the same SYNSET as “Description”. Third, it looks for sentencestructure in the column values and leverages grammar to identify phrasesand phrase patterns. It also checks to see if there is a column thatcontains a majority of phrases/phrase patterns. If so, the processortentatively chooses it as the description column. Then, the processoranalyzes the proportion of verbs across columns. If the column with thehighest count of verbs matches the column containing the majority ofphrases/phrase patterns, it asserts that column is the descriptioncolumn. If not, it asserts that the column with the highest count ofverbs is the description column. At 1006, each captured sentence isadded to the CORPUS-SENTENCE set. At 1008, it is determined by theprocessor if there are more tables in the document. If “YES” then theprocessing repeats at 1002. If “NO” then processing ends at 1009.

FIG. 11 is a more detailed flow diagram for processing 1100 of step 618of FIG. 6 by the processor. Flow begins at 1101. At 1102, theCORPUS-TERM set is processed by the processor if it has members. At1104, the TOC-ENTRY set is processed if it has members. At 1106, theINDEX-ENTRY set is processed if it has members.

FIG. 12 is a more detailed flow diagram for processing 1200 step 1102 ofFIG. 11. Flow begins at 1201. At 1202, the next term from theCORPUS-TERM set is found. At 1204, each sentence in the document thatcontains the term if found by the processor. At 1206, the sentences areadded to the CORPUS-SENTENCE set. At 1208, it is determined if there aremore entries in the CORPUS-TERM set. If “YES” the process repeats at1202. If “NO” the process ends at 1209.

FIG. 13 is a more detailed flow diagram for processing 1300 step 1104 ofFIG. 11. Flow beings at 1301. At 1302, the next major heading entry fromthe TOC-ENTRY set is found. At 1304, the heading is found in text. At1306, sentences following the heading are selected until a major orminor heading is encountered or configured sentence-capture-count isreached. At 1308, each selected sentence is added to the CORPUS-SENTENCEset. At 1310, it is determined by the processor if there are moreentries in the TOC-ENTRY set. If “YES” then the process repeats at 1302.If “NO” then the process ends at 1311.

FIG. 14 is a more detailed flow diagram for processing 1400 step 1106 ofFIG. 11. Flow begins at 1401. At 1402, the next term from theINDEX-ENTRY set is found. At 1404, the page or section associated withthat term is found. At 1406, each sentence on the target page or targetsection containing the term is selected and at 1408 is added to theCORPUS-SENTENCE set. At 1410, it is determined by the processor if thereare more terms in the INDEX-ENTRY set. If “YES”, the process repeats at1402. If “NO”, the process ends at 1411.

Referring back to the action trigger producer 120 of FIG. 1, writersencode information into documents in a hierarchical manner, whosestructure and implied semantic meaning is readily apparent to humanreaders of the document. Often this information is conveyed usingconventions for highlighting (bold, italic, font change, etc.) andindenting. The structure, known by repetition, and the structure'simplied semantic meaning is readily apparent to the human reader of thedocument. Identifying these structures algorithmically is a topic of thepresent disclosure. The processing creates as output a collection ofaction-trigger phrase sets in the system architecture of FIG. 1.

In one embodiment, these phrase sets, made up of comma-separated values,have this format: Action, Trigger, Document, Location. The Action is thepotential user action or topic, discovered by query processing by theneural network match between the user query and the Trigger. The Triggeris the text against which the neural network compares the user query fora potential match. The Document identifies the document in the corpus.It can include a document title, document number, collection name,document location—for example, a SharePoint location, network filesystem folder, or corporate data base. Each embodiment and each corpuswill have specific requirements for document identification.Hereinafter, this is referred to as <document_id>. The Locationidentifies the location of the Action in the document. It can include asection name and number, table name and number, figure name and number,page number, hyperlink, referenced file and so on. Each embodiment andeach corpus will have specific requirements for Action location.Hereinafter, this is referred to as <location>.

FIG. 15 is a flow diagram illustrating the overall processing 1500 forthe action-trigger producer 120 of FIG. 1 by the processor. Flow beginsat 1501. At 1502, a corpus of documents to be processed is received. At1504, it is determined by the processor if the document corpus containsa glossary. If “YES” then a CORPUS-TERM set is created from the glossaryat 1506. If “NO”, an empty CORPUS-TERM set is created at 1508. At 1510,ACTION-TRIGGER PHRASE sets are generated from the corpus. At 1512, theneural network tailored for this document corpus is found. At 1514, theneural network is applied to the ACTION-TRIGGER PHRASE sets. At 1516, aresulting trained database tailored for this document corpus is created.Flow ends at 1517. Applying the neural network to the ACTION-TRIGGERPHRASE sets to create a trained database improves search time andefficiency and reduces the computational power necessary to search inresponse to a user query. It also improves the accuracy of the resultsof the search.

FIG. 16 is a more detailed flow diagram of the processing 1600 for step1510 of FIG. 15 by the processor. Flow begins at 1601. At 1602, the nextdocument in the corpus is found. At 1604, text is selected from thedocument and ACTION-TRIGGER PHRASE sets are created. At 1606, it isdetermined if more documents exist in the corpus. If “YES”, the processrepeats at 1602. If “NO”, flow ends at 1607.

FIG. 17 is a more detailed flow diagram of the processing of step 1604from FIG. 16 by the processor. Flow begins at 1701. At 1702, it isdetermined by the processor if the document contains a table ofcontents. If “YES”, then the table of contents is processed at 1704. If“NO”, it is determined by the processor if the document contains aglossary at 1706. If “YES”, then the glossary is processed at 1708. If“NO”, then it is determined by the processor if the document contains anindex at 1710. If “YES”, the index is processed at 1712. If “NO”, thedocument text is processed at 1718. Because individual sentences may beselected multiple times using the table of contents, glossary and indexstructures, at 1720, duplicates are eliminated by the processor. At1722, the ACTION-TRIGGER PHRASE sets are written. Flow ends at 1724.

Referring back to FIG. 7, the same flow process can be used for addingto the ACTION-TRIGGER PHRASE sets. Referring to FIG. 18, the processingis similar to that of FIG. 8, except that at 1808, an action-triggerphrase from the table of contents entry is created. Referring to FIG.19, the processing is similar to that of FIG. 9, except that at 1908, anaction-trigger phrase from the index entry is added to theACTION-TRIGGER PHRASE set.

FIG. 20 is a more detailed flow diagram of the processing of step 1722of FIG. 17 by the processor. Flow begins at 2001. At 2002, if theCORPUS-TERM set has members, it is processed by the processor. At 2004,if the TOC-ENTRY set has members, it is processed by the processor. At2006, if the INDEX-ENTRY set has members, it is processed by theprocessor. The remaining processing goes through all the sections in thedocument to search for additional formal structures from which semanticinformation can be captured automatically. At 2008, the text isprocessed for cross references by the processor. Sections, paragraphsand sentences with more cross references are more interesting from asemantic value standpoint. At 2010, the text is processed forhighlights. This includes font changes, bolding, italicization, pop upsand so on. At 2012, the text is processed for explicit tables. At 2014,the text is processed for implicit tables. (See FIG. 27 below forimplicit table discussion.) Flow ends at 2016.

FIG. 21 is a more detailed flow diagram of the processing 2100 for step2002 of FIG. 20 by the processor. Flow beings at 2101. At 2102, the nextterm from the CORPUS-TERM set is found. At 2104, each sentence in thedocument containing the term is found and sentences are chosen to becaptured. From the discovered sentences, the processor chooses sentencesthat meet all of these criteria: 1) does the term appear in the firsthalf of the sentence; 2) does the sentence appear in the first half ofthe paragraph; 3) does the term appear in more than one sentence in thesection; and 4) does the sentence appear in prose text, not in a formaltable or implicit table. At 2106, an action-trigger phrase is createdfrom each chosen term and sentence. At 2108, the action-trigger phraseis added to the ACTION-TRIGGER PHRASE set. At 2110, it is determined bythe processor if there are more entries in the CORPUS-TERM set. If“YES”, then the process repeats at 2102. If “NO” then flow ends at 2112.

FIG. 22 is a more detailed flow of the processing 2200 for step 2004 ofFIG. 20 by the processor. Flow begins at 2201. At 2202, the next majorheading entry from the TOC-ENTRY set is found. At 2204, headings in thetext are found. At 2206, an action-trigger phrase is created using theheading text. The processing recognizes several possible informationformats that may follow the heading, including indentation and spacingpattern, “format” and “purpose” keywords and their SYNSET equivalentsand “format” and “purpose” words and their SYNSET equivalents appearingin sentences. At 2208, the action-trigger phrase is added to theACTION-TRIGGER PHRASE set. At 2210, it is determined if there are moreentries in the TOC-ENTRY set. If “Yes”, the process repeats at 2202. If“NO”, flow ends at 2212.

FIG. 23 is a more detailed flow diagram of the processing 2300 for step2006 of FIG. 20 by the processor. Flow begins at 2301. At 2302, the nextterm from the INDEX-ENTRY set is found. At 2304, the page or sectionassociated with the term is found. Within that section, the processorcaptures each unique sentence that contains the index term and createsan action-trigger phrase at 2306. At 2308, the action-trigger phrase isadded to the ACTION-TRIGGER PHRASE set. At 2310, it is determined ifmore terms are in the INDEX-ENTRY set. If “YES”, the process repeats at2302. If “NO”, the process ends at 2312.

FIG. 24 is a more detailed flow diagram of the processing 2400 for step2008 of FIG. 20 by the processor. Flow begins at 2401. At 2402, the nextcross reference in the text is found. Each cross reference providesvaluable semantic information about the importance of target sections inthe document. To determine which text is a cross reference, theprocessor applies pattern match and Natural Language Processingtechniques. At 2404 the cross reference information is captured. At2406, it is determined by the processor if more cross references are inthe text. If “YES”, flow branches to 2402. If “NO” the cross referencesare sorted by number of target occurrences at 2408. At 2410, the nextapproved cross reference is found. At 2412, an action-trigger phrase isgenerated, and at 2414, it is added to the ACTION-TRIGGER PHRASE set. At2416, it is determined by the processor if there are more approved crossreferences. If “YES” flow branches to 2410. If “NO” flow ends at 2418.

FIG. 25 is a more detailed flow diagram of the processing 2500 for step2010 of FIG. 20 by the processor. Flow begins at 2501. At 2502, the nextsentence from the document is found. At 2504, it is determined if thesentence has highlights. Highlights can include a change in font name,font size, from non-bold to bold or from non-italic to italic.Highlights can also include the whole sentence is italic or bold and thesentence contains a pop-up or hyper link. If “NO”, flow branches to2502. If “YES”, at 2506 each highlight is captured into anaction-trigger phrase and at 2508, and it is added to the ACTION-TRIGGERPHRASE set. At 2510, it is determined if there are more sentences in thedocument. If “YES”, flow branches to 2502. If “NO”, flow ends at 2511.

FIG. 26 is a more detailed flow diagram of the processing for explicittables 2600, for step 2012 of FIG. 20, by the processor. Flow begins at2601. At 2602, an explicit table is discovered. At 2604, sentences fromthe “description” column of the table are captured. At 2606, from eachcaptured sentence and corresponding table, an action-trigger phrase iscreated, and at 2608 it is added to the ACTION-TRIGGER PHRASE set. At2610 the processor determines if there are more explicit tables in thedocument. If “YES” the process repeats at 2602. If “NO”, the flow endsat 2611.

Implicit tables are an important part of the semantic information in atechnical document. Implicit tables are not called out as a table butare an assumption to a structure of semantic information in a documentand follow a pattern of indenting, tabbing or spacing that set apartinformation from the rest of the text. As an example, there may be anindention to a first descriptive phrase and a second indention to asecond descriptive phrase:

DPSIF The File Initialization Processor initializes DPS 200 systemfiles. DPSPW The processor generates information about authorized users.The first descriptive phrase would be the action and the seconddescriptive phrase would be the trigger. The second descriptive phrasemay appear on the same line as the first descriptive phrase or mayappear on a next line. The implicit table could also be set apart bybold or unbold text:

-   -   Bits 0-5 Short_Status_Field (SSF)    -   Contains interrupt status when found in an        Interrupt_Control_Stack frame    -   Bits 6-8 Mid-Instruction-Description (MID) Flags    -   See 5.1.3 for a complete description of these bits.    -   Bit 6 Instruction in F0 (INF)    -   See 2.2.5 for a description of Instruction in F0.        As another example, bulleted paragraphs versus non-bulleted        paragraphs. The key is in looking for repetition or a pattern in        the formatting of text to find implicit tables having a first        descriptive phrase as the action and a second descriptive phrase        as the trigger. Implicit tables can be nested within implicit or        explicit tables. Explicit tables can be nested within implicit        or explicit tables as well.

FIG. 27 is a more detailed flow diagram of the processing for implicittables 2700, for step 2014 of FIG. 20 by the processor. Flow begins at2701. At 2702, an implicit table is discovered. At 2704, column 1 andcolumn 2 phrases are captured. At 2706, from each captured sentence andcorresponding table, the processor creates an action-trigger phrase, andat 2708, adds it to the ACTION-TRIGGER PHRASE set. At 2710, theprocessor determines if there are more implicit tables in the document.If “YES”, the process repeats at 2702. If “NO”, flow ends at 2711.

FIG. 28 illustrates one embodiment of a system 2800 for an informationsystem, which may host virtual machines. The system 2800 may include aserver 2802, a data storage device 2806, a network 2808, and a userinterface device 2810. The server 2802 may be a dedicated server or oneserver in a cloud computing system. The server 2802 may also be ahypervisor-based system executing one or more guest partitions. The userinterface device 2810 may be, for example, a mobile device operated by atenant administrator. In a further embodiment, the system 2800 mayinclude a storage controller 2804, or storage server configured tomanage data communications between the data storage device 2806 and theserver 2802 or other components in communication with the network 2808.In an alternative embodiment, the storage controller 2804 may be coupledto the network 2808.

In one embodiment, the user interface device 2810 is referred to broadlyand is intended to encompass a suitable processor-based device such as adesktop computer, a laptop computer, a personal digital assistant (PDA)or tablet computer, a smartphone or other a mobile communication devicehaving access to the network 2808. The user interface device 2810 may beused to access a web service executing on the server 2802. When thedevice 2810 is a mobile device, sensors (not shown), such as a camera oraccelerometer, may be embedded in the device 2810. When the device 2810is a desktop computer the sensors may be embedded in an attachment (notshown) to the device 2810. In a further embodiment, the user interfacedevice 2810 may access the Internet or other wide area or local areanetwork to access a web application or web service hosted by the server2802 and provide a user interface for enabling a user to enter orreceive information.

The network 2808 may facilitate communications of data, such as dynamiclicense request messages, between the server 2802 and the user interfacedevice 2810. The network 2808 may include any type of communicationsnetwork including, but not limited to, a direct PC-to-PC connection, alocal area network (LAN), a wide area network (WAN), a modem-to-modemconnection, the Internet, a combination of the above, or any othercommunications network now known or later developed within thenetworking arts which permits two or more computers to communicate.

In one embodiment, the user interface device 2810 accesses the server2802 through an intermediate sever (not shown). For example, in a cloudapplication the user interface device 2810 may access an applicationserver. The application server may fulfill requests from the userinterface device 2810 by accessing a database management system (DBMS).In this embodiment, the user interface device 2810 may be a computer orphone executing a Java application making requests to a JBOSS serverexecuting on a Linux server, which fulfills the requests by accessing arelational database management system (RDMS) on a mainframe server.

FIG. 29 illustrates a computer system 2900 adapted according to certainembodiments of the server 2802 and/or the user interface device 2810.The central processing unit (“CPU”) 2902 is coupled to the system bus2904. The CPU 2902 may be a general purpose CPU or microprocessor,graphics processing unit (“GPU”), and/or microcontroller. The presentembodiments are not restricted by the architecture of the CPU 2902 solong as the CPU 2902, whether directly or indirectly, supports theoperations as described herein. The CPU 2902 may execute the variouslogical instructions according to the present embodiments.

The computer system 2900 also may include random access memory (RAM)2908, which may be synchronous RAM (SRAM), dynamic RAM (DRAM),synchronous dynamic RAM (SDRAM), or the like. The computer system 2900may utilize RAM 2908 to store the various data structures used by asoftware application. The computer system 2900 may also include readonly memory (ROM) 2906 which may be PROM, EPROM, EEPROM, opticalstorage, or the like. The ROM may store configuration information forbooting the computer system 2900. The RAM 2908 and the ROM 2906 holduser and system data, and both the RAM 2908 and the ROM 2906 may berandomly accessed.

The computer system 2900 may also include an input/output (I/O) adapter2910, a communications adapter 2914, a user interface adapter 2916, anda display adapter 2922. The I/O adapter 2910 and/or the user interfaceadapter 2916 may, in certain embodiments, enable a user to interact withthe computer system 2900. In a further embodiment, the display adapter2922 may display a graphical user interface (GUI) associated with asoftware or web-based application on a display device 2924, such as amonitor or touch screen.

The I/O adapter 2910 may couple one or more storage devices 2912, suchas one or more of a hard drive, a solid state storage device, a flashdrive, a compact disc (CD) drive, a floppy disk drive, and a tape drive,to the computer system 2900. According to one embodiment, the datastorage 2912 may be a separate server coupled to the computer system2900 through a network connection to the I/O adapter 2910. Thecommunications adapter 2914 may be adapted to couple the computer system2900 to the network 2908, which may be one or more of a LAN, WAN, and/orthe Internet. The communications adapter 2914 may also be adapted tocouple the computer system 2900 to other networks such as a globalpositioning system (GPS) or a Bluetooth network. The user interfaceadapter 2916 couples user input devices, such as a keyboard 2920, apointing device 2918, and/or a touch screen (not shown) to the computersystem 2900. The keyboard 2920 may be an on-screen keyboard displayed ona touch panel. Additional devices (not shown) such as a camera,microphone, video camera, accelerometer, compass, and or gyroscope maybe coupled to the user interface adapter 2916. The display adapter 2922may be driven by the CPU 2902 to control the display on the displaydevice 2924. Any of the devices 2902-2922 may be physical and/orlogical.

The applications of the present disclosure are not limited to thearchitecture of computer system 2900. Rather the computer system 2900 isprovided as an example of one type of computing device that may beadapted to perform the functions of a server 2802 and/or the userinterface device 2810. For example, any suitable processor-based devicemay be utilized including, without limitation, personal data assistants(PDAs), tablet computers, smartphones, computer game consoles, andmulti-processor servers. Moreover, the systems and methods of thepresent disclosure may be implemented on application specific integratedcircuits (ASIC), very large scale integrated (VLSI) circuits, or othercircuitry. In fact, persons of ordinary skill in the art may utilize anynumber of suitable structures capable of executing logical operationsaccording to the described embodiments. For example, the computer system2900 may be virtualized for access by multiple users and/orapplications. The applications could also be performed in a serverlessenvironment, such as the cloud.

Referring to FIG. 30, an example RDMS database design 3000 is shown thatcan be used in implementations of the present disclosure.

If implemented in firmware and/or software, the functions describedabove may be stored as one or more instructions or code on acomputer-readable medium. Examples include non-transitorycomputer-readable media encoded with a data structure andcomputer-readable media encoded with a computer program.Computer-readable media includes physical computer storage media. Astorage medium may be any available medium that can be accessed by acomputer. By way of example, and not limitation, such computer-readablemedia can comprise RAM, ROM, EEPROM, CD-ROM or other optical diskstorage, magnetic disk storage or other magnetic storage devices, or anyother medium that can be used to store desired program code in the formof instructions or data structures and that can be accessed by acomputer. Disk and disc includes compact discs (CD), laser discs,optical discs, digital versatile discs (DVD), floppy disks and blu-raydiscs. Generally, disks reproduce data magnetically, and discs reproducedata optically. Combinations of the above should also be included withinthe scope of computer-readable media. A serverless environment, such asthe cloud, could also be used.

In addition to storage on computer readable medium, instructions and/ordata may be provided as signals on transmission media included in acommunication apparatus. For example, a communication apparatus mayinclude a transceiver having signals indicative of instructions anddata. The instructions and data are configured to cause one or moreprocessors to implement the functions outlined in the claims. Aserverless environment, such as the cloud, could also be used.

Although the present disclosure and its advantages have been describedin detail, it should be understood that various changes, substitutionsand alterations can be made herein without departing from the spirit andscope of the disclosure as defined by the appended claims. Moreover, thescope of the present application is not intended to be limited to theparticular embodiments of the process, machine, manufacture, compositionof matter, means, methods and steps described in the specification. Asone of ordinary skill in the art will readily appreciate from thepresent invention, disclosure, machines, manufacture, compositions ofmatter, means, methods, or steps, presently existing or later to bedeveloped that perform substantially the same function or achievesubstantially the same result as the corresponding embodiments describedherein may be utilized according to the present disclosure. Accordingly,the appended claims are intended to include within their scope suchprocesses, machines, manufacture, compositions of matter, means,methods, or steps.

1. A method processing a document corpus, the method comprising:receiving a corpus of documents; selecting a set of sentences from thecorpus and creating a corpus-sentence set; processing thecorpus-sentence set using BERT to create word-level association;processing the corpus-sentence set using SBERT to create sentence-levelassociation; and creating a neural network for the corpus using the setof sentences, word-level associations and sentence-level associations.2. The method of claim 1, further comprising creating a corpus-term set.3. The method of claim 2, further comprising processing a group of termsand adding the term to the corpus-term set.
 4. The method of claim 3,further comprising capturing a definition for the term and adding thedefinition to the corpus-sentence set.
 5. The method of claim 4, whereinthe group of e s can be a table of contents, glossary, index, explicittables or implicit tables.
 6. The method of claim 2, further comprisingprocessing the corpus-term set to find each sentence in the documentthat contains a term.
 7. The method of claim 6, adding each sentencethat contains the term to the corpus-sentence set.
 8. The method ofclaim 1, further comprising applying an action-trigger phase set to theneural network to create a trained database.
 9. The method of claim 8,wherein a user query can be processed against the trained databaseimproving the speed of processing the user query.
 10. The method ofclaim 9, wherein a cosine similarity score is computed between the queryand the trigger.
 11. A computer program product, comprising: anon-transitory computer readable medium comprising instructions which,when executed by a processor of a computing system, cause the processorto perform the steps of: receiving a corpus of documents; selecting aset of sentences from the corpus and creating a corpus-sentence set;processing the corpus-sentence set using BERT to create word-levelassociation; processing the corpus-sentence set using SBERT to createsentence-level association; and creating a neural network for the corpususing the set of sentences, word-level associations and sentence-levelassociations.
 12. The computer program product of claim 11, furthercomprising creating a corpus-term set.
 13. The computer program productof claim 12, further comprising processing a group of terms and addingthe term to the corpus-term set.
 14. The computer program product ofclaim 13, further comprising capturing a definition for the term andadding the definition to the corpus-sentence set.
 15. The computerprogram product of claim 14, wherein the group of terms can be a tableof contents, glossary, index, explicit tables or implicit tables. 16.The computer program product of claim 12, further comprising processingthe corpus-term set to find each sentence in the document that containsa term.
 17. The computer program product of claim 16, adding eachsentence that contains the term to the corpus-sentence set.
 18. Thecomputer program product of claim 11, further comprising applying anaction-trigger phase set to the neural network to create a traineddatabase.
 19. The computer program product of claim 18, wherein a userquery can be processed against the trained database improving the speedof processing the user query.
 20. The computer program product of claim19, wherein a cosine similarity score is computed between the query andthe trigger.